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SUMMARY 


During  the  last  year  signiHcant  progress  has  been  made  in  the  primary  objective  of  estimating  the 
acoustic  characteristics  of  speech  from  the  visual  speech  signals.  Neural  networks  have  been  trained  on 
a  database  of  vowels.  The  raw  images  of  faces,  aligned  and  preprocessed,  were  used  as  mput  to  these 
network  which  were  trained  to  estimate  the  corresponding  envelope  of  the  acoustic  spectrum.  The  per¬ 
formance  of  the  networks  was  better  than  trained  humans  and  was  comparable  with  optimized  pattern 
classifiers.  Our  approach  avoids  the  problems  of  information  loss  through  early  categorizaton.  The 
acoustic  information  that  the  network  extracts  from  the  visual  signal  can  be  used  to  supplement  the 
acoustic  signal  in  noisy  enviromnents,  such  as  cockpits.  During  the  next  year  we  plan  to  extend  these 
results  to  diphthongs  using  recurrent  neural  networks  and  temporal  sequences  of  input  images. 
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I  INTRODUCTION 

Speaking  produces  acoustic  and  visual  signals.  When  the  acoustic  speech  signal  is  degraded  by 
noise,  the  visu^  signal  can  provide  supplemental  speech  information  that  improves  speech  perception 
(Sumby  and  Pollack,  1954;  Ewersten  and  bHelsen,  1971;  Erber,  1975).  When  the  acoustic  signal  is 
totally  unavailable,  as  with  the  profoimdly  deaf,  then  the  visual  signal  can  be  used  through  lip  reading. 
(Montgomery,  1983;  Summerfield,  1979;  Demorest,  Bernstein  and  Eberhardt,  1987). 

In  this  project,  neural  networks  are  being  used  to  process  visual  speech  signals  in  order  to  study 
the  feasibility  of  obtaining  acoustic  constraints  directly  from  the  raw  visual  image.  At  present, 
automatic  speech  recognition  systems  rely  almost  excludvely  on  the  acoustic  speech  signal.  This  contri¬ 
butes  to  the  poor  perform  of  these  systems  often  demonstrated  in  noisy  environments  (Allen,  1985). 
While  some  effort  has  been  made  at  cleaning  up  the  acoustic  input  to  these  systems,  few  have 
'  attempted  to  use  the  visual  information  to  supplement  the  acoustic  signal. 

The  only  speech  recognition  system  that  has  extensively  used  the  visual  signals  was  developed  by 
Eric  Petajan  of  Bell  Labs  (Petajan,  1984,  1987).  This  system  uses  stored  templates  to  identify 
sequences  of  lip  images.  On  a  limited  vocabulary,  Petajan  was  able  to  demonstrate  that  using  the  visual 
speech  signals  can  significantly  improve  speech  recognition  over  acoustic  recognition  alone. 

The  computational  constraints  imposed  by  serial,  digital  computers  often  makes  it  necessary  to 
encode  stored  templates.  The  von  Neumaim  bottleneck  between  the  memory  and  the  processors 
I  requires  that  the  incoming  images  and  the  stored  templates  have  a  reduced  dimensionality  that  minim¬ 

izes  the  necessary  computation.  In  this  system,  Petajan  uses  vector-quantization  to  construct  a  code¬ 
book  that  is  used  to  translate  incommg  image  sequences  mto  symbol  strings.  The  question  is  whether 
the  encoding  process  preserves  the  relevant  information.  Petajan  found  that  encoding  the  images 
resulted  in  a  poorer  performance,  suggesting  that  there  was  a  loss  of  information. 

The  neural  networks  is  an  alternative  architecture  characterized  by  many  interconnected  pro- 
'  cessors  that  perform  their  computation  in  paraUel.  This  architecture  offers  new  approaches  to  signal 

processing  by  eliminating  the  need  to  imme^ately  encode  signals  into  a  lower  dimension. 

i  THE  VISUAL  AND  ACOUSTIC  SIGNALS  OF  SPEECH 


In  linguistics,  the  continuous  speech  signals  are  traditionally  treated  as  a  sequence  of  discrete 
components.  Phonemes  are  the  shortest  acoustically  distmguishing  unit  of  a  given  language.  For  exam¬ 
ple,  the  words  beat  and  neat  are  distinguished  by  the  to  the  phonemes  [b]  and  [n].  Similarly,  boot  and 
beat  are  distinguished  by  the  phonemes  [u]  and  [ij,  which  are  abstractions  corresponding  to  the  ‘oo’  and 
‘ea’  sounds  in  the  two  words.  The  sounds  themselves  are  identified  phonetically  as  /u/  and  /i/  to  dis¬ 
tinguish  them  from  the  linguistic  abstractions  [u]  and  [i]. 

The  visual  correlate  of  the  phoneme  is  the  visente.  The  viseme  is  the  smallest  visibly  distinguish¬ 
ing  unit  of  a  given  language  (Fisher,  1968).  The  mapping  between  the  phonemes  and  visemes  is  gen¬ 
erally  many  to  one;  for  example,  the  phonemes  (p],[b]  and  [m]  are  usually  visibly  indistinguishable  and 
treated  as  a  single  viseme. 

Speech  recognition  research  has  been  largely  preoccupied  with  trying  to  find  a  reliable  method 
of  translating  the  continuous  acoustic  signal  mto  a  corresponding  phonemic  sequence  (Reddy,  1966, 
1967).  This  effort  has  been  plagued  by  problems  in  segmenting  the  continuous  signals  and  also  at  the 
level  of  identifying  those  segments.  Recently,  however,  the  most  successful  speech  recognition  systems 
avoid  this  procedure  all  together  (Jelinek,  1985).  These  results  suggest  that  alternative  approaches  of 
using  the  visual  signal  should  also  be  explored. 

The  acoustic  speech  signal  that  is  emitted  from  the  mouth  has  long  been  modeled  as  the 
response  of  the  vocal  tract  filter  to  a  sound  source  (Fant,  1960;  Flanagan,  1972).  In  this  first-order 
model,  it  is  the  configuration  of  the  articulators  that  defrne  the  vocal  tract  f  liter’s  shape,  and  its 
corresponding  resonance  characteristics.  These  resonance  characteristics  are  represented  in  the  acous¬ 
tic  waves  short-term  power  spectrum's  amplitude  envelope. 

While  some  of  the  articulators  are  visible  on  the  face  of  the  speaker  (e.g.,  the  lips,  teeth  and 
sometimes  the  tongue),  others  are  not.  The  visible  articulators’  contribution  to  the  acoustic  signal 
result  in  speech  sounds  that  are  much  more  susceptible  to  acoustic  noise  distortion  than  are  those 
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contributions  of  the  hidden  articulators  (Petajan,  1987).  As  a  result,  the  visual  speech  signal  tends  to 
complement  the  acoustic  signal  Thus,  th<»e  speech  sounds  that  are  the  most  visibly  distinct,  such  as 
/b/  and  /k/,  are  among  the  first  pairs  to  be  confused  when  presented  acoustically  in  the  presence  of 
noise.  Similarly,  those  phonetic  segments  that  are  visibly  indistinguishable,  such  as  /p/,/b/  and  /m/ 
are  among  the  most  resistant  to  confusion  ndien  presented  acoustically  (Miller  and  Nicely,  1955;  Wal¬ 
den,  et.al  1977).  This  complementary  structure  serves  as  the  basis  on  which  the  two  signals  can 
interact  to  improve  the  perception  of  speech  in  noise. 

RESEARCH  OBJECTIVES 

The  focus  of  this  research  is  to  study  the  feasibility  of  using  the  wual  speech  signal  to  define 
acoustic  constraints.  The  approach  has  t^n  to  try  and  estimate  the  short-term  power  spectral 
envelope  of  the  acoustic  signal  from  the  corresponding  visual  signals  on  the  face  of  the  speaker.  The 
transfer  function  of  the  vocal  tract  can  then  be  described  from  this  spectral  envelope. 

Estimates  of  the  transfer  function  are  obtained  using  a  variety  of  neural  networks  architectures. 
These  estimates  are  then  evaluated  and  compared  with  results  from  other  estimation  techniques. 

The  neural  network  architectiu'es  that  we  are  working  with  are  simulated  on  an  ANALOGIC 
AP5000  array-processor.  These  architectures  are  in  being  implemented  in  special  parallel  hardware, 
and  eventually  should  be  readily  available. 

NEURAL  NETWORK  BACKGROUND 

Neural  networks  compute  using  many  interconnected  processors  that  individually  perform  a 
simple  transformation  of  their  summed  inputs.  The  connections  between  processors  have  weights  asso¬ 
ciated  with  them  and  signals  traveling  along  those  connections  are  multiplied  by  those  weights.  The 
network’s  response  to  a  particular  input  is  based  on  the  exchange  of  signals  between  processors  across 
these  weighted  connections.  A  particular  response  can  be  programmed  by  specifying  the  connections 
and  their  associated  weights.  An  introduction  to  neural  networks  can  be  found  in  April  1987’s  IEEE 
ASSP  Magazine  (Lippman,  1987;  Rumelhart,  McClelland,  and  the  PDF  Research  Group,  1986). 

By  defming  the  correct  weights,  these  networks  can  be  constructed  to  solve  a  variety  of  prob¬ 
lems  (Hopfield  and  Tank,  1986;  Marr  and  Poggio,  1976).  However,  until  recently  defining  these 
weights  was  an  arduous  task  that  required  an  a  priori  solution  to  the  underlying  problem.  Now  algo¬ 
rithms  exists  that  iteratively  adjust  these  weights  given  a  set  examples  (Pineda,  1987;  Rumelhart, 
et.al.,1985).  The  ability  to  automatically  program  these  networks  has  resulted  in  a  flurry  of  experimen¬ 
tal  work  aimed  at  demonstrating  the  computational  power  of  this  architecture.  Using  these  algorithms, 
networks  have  already  demonstrated  their  ability  to  fmd  solutions  to  a  variety  of  problems  (Lippman, 
1987;Sejnowski  and  Gorman,  1988;  Sejnowski  and  Rosenberg,  1987;  Pragar  et.al.,  1986;). 

One  of  the  areas  neural  networks  is  strongest  is  in  solving  ill-posed  problems,  where  a  solution 
may  not  exist  or  may  not  be  unique  (Poggio,  et  al.,  1985).  Estimating  acoustic  structure  from  visual 
speech  signals  is  such  an  ill-posed  problems.  The  visual  signals  provide  only  a  partial  description  of  the 
vocal  tract  transfer  function,  and  that  which  is  described  is  ambiguous.  For  a  given  visual  signal  there 
are  many  possible  corresponding  acoustic  structures.  What  we  want  is  a  good  estimate  based  upon 
known  examples. 

THE  APPROACH 

A  variety  neural  network  architectures  were  trained  to  approximate  the  acoustic  signal’s  power 
spectrum  envelope  given  the  corresponding  visual  signal  as  input.  Since  the  system  was  given  single 
isolated  video  images,  which  coricspond  to  33ms  of  speech,  it  was  necessary  to  choose  data  from 
vowels  and  diphthongs,  which  are  relatively  steady  state  over  periods  of  5D  to  80  ms. 

The  input  signal  structure  was  chosen  to  exploit  the  distributed  representations  that  neural  net¬ 
works  allow.  The  video  input  signals  are  extracted  from  recordings  of  full-faced,  well-illuminated 
speakers  preserved  on  laser  disc  (Bernstein  and  Eberhardt,  1987).  Software  was  written  to  automati¬ 
cally  deflne  a  box  centered  about  the  mouth  and  extract  that  portion  of  the  image.  The  computational 
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load  of  simulating  a  oeural  network  on  a  serial  machine  made  it  necessary  to  further  sub-sample  these 
images,  reducing  the  input  to  500  pixels.  It  is  important  to  emphasize  that  the  particular  encoding  we 
have  chosen  is  an  artifact  of  our  simulation.  Once  these  networks  are  implemented  in  hardware,  visual 
images  can  be  received  in  parallel  across  arrays  of  sensors  and  then  processed  in  parallel  without  sam¬ 
pling. 

Corresponding  to  each  video  frame  is  33ms  of  acoustic  speech.  When  the  visual  signal  is 
presented  to  Ae  network,  the  network  is  asked  to  produce  the  amplitude  envelope  of  the  256  point 
short-term  power  spectrum  (STPS)  of  the  corresponding  acoustic  signaL  During  the  training  process, 
the  network  is  presented  with  both  the  mput  and  output  structures.  To  obtain  the  spectral  envelope, 
the  cepstrum  was  taken  of  the  STPS  and  the  low-pass  lifted  below  T/4,  v^ere  T  is  the  original  length 
of  the  input  signal.  Taking  the  inverse  cei^trum  of  the  resulting  data  provided  a  smooth  envelope  of 
the  original  power  spectrum  that  could  be  sampled  down  to  32  points. 

The  networlu  were  trained  on  119  images  of  vowels  and  diphthongs.  The  training  period  usu¬ 
ally  involved  450  presentations  of  each  image  in  the  training  set.  The  weights  were  updated  after  all 
images  were  presented.  For  a  given  network  architecture,  the  particular  weights  to  be  evaluated  will  be 
those  that  give  the  network  its  best  performance  for  data  not  in  the  training  set. 

EVALUATION 

The  performance  of  the  network  is  evaluated  on  two  criteria.  The  Hrst  criterion  is  the  weighted 
squared  error  between  the  spectral  envelope  produced  by  the  network  and  the  true  envelope.  The 
weighting  places  greater  emphasis  on  the  peaks  of  the  spectral  envelope  than  on  the  valleys.  This  is 
accomplished  by  multipl^g  the  squared  difference  of  two  points  by  the  amplitude  of  the  greater  value 
of  those  two  points.  For  a  given  input  image,  /p,  a  32  point  approximation  of  the  associated  acoustic 
spectral  envelope  is  produced.  Then  if  the  desired  value  for  the  /  -th  component  of  the  spectrum  is  t,p, 
and  the  approximated  value  is  o^p,  we  can  defme  an  error  measure  for  that  approximate  envelope  to  be, 

32 

I 

This  weighting  scheme  reflects  the  relative  importance  of  the  the  height  of  spectral  envelope  as  demon¬ 
strated  by  speech  recognition  tasks  (Miller,  1953).  This  is  also  the  error  measure  used  by  many  speech 
recognition  systems  to  compare  an  imknown  acoustic  spectra  to  a  set  of  stored  templates  (Klatt,  1976; 
Jelinek,  1985). 

The  second  criterion  used  to  evaluate  the  network  is  a  forced  choice  test.  In  this  test,  the  spec¬ 
tral  estimate  obtained  from  a  given  image  is  compared  to  a  set  of  known  spectral  envelopes.  Based  on 
the  prior  error  measure,  the  closest  envelope  is  identified  and  selected.  A  selection  is  considered  suc¬ 
cessful  if  it  is  the  actual  spectnun  associated  with  that  original  image  or  if  it  is  spectrum  associated  with 
another  example  of  the  same  vowel.  In  addition  to  the  number  of  correct  matches,  the  confusions  are 
also  of  interest  since  they  reveal  whether  the  network  is  processing  the  images  in  a  manner  similar  to 
humans  lip  readers.  As  part  of  this  research,  the  networks  confusions  will,  be  compared  to  those  confu¬ 
sions  made  by  other  estimation  methods,  and  by  humans  lip  readers. 

RESULTS  AND  EVALUATION 

Over  the  last  year,  we  have  constructed  and  trained  numerous  neural  networks.  These  networks 
have  allowed  us  to  explore  the  effects  of  the  connectivity  structure,  the  number  of  hidden  units,  and  the 
type  of  transfer  function  used  by  the  processors.  In  ad^don  to  the  comparisons  amongst  the  different 
networks,  the  best  network  estimates  were  then  compared  to  a  variety  of  other  estimation  methods. 

While  It  was  difficult  to  assess  these  estimates  from  the  value  of  the  error  function  alone,  the 
forced  choice  test  has  provided  a  tangible  measure  of  performance.  When  compared  to  the  perfor¬ 
mance  of  humans  on  the  similar  task  of  classifying  vowels  from  the  visual  signal  alone,  the  networks 
performed  better.  The  acoustic  envelope  estimated  by  the  networks  was  correctly  matched  to  the  same 
vowel  from  62%  to  68%  of  the  time.  In  comparison,  human  lipreaders  have  demonstrated  in  previous 
research  performance  levels  of  between  49%  and  54%  (Jackson,  et.al.,  1986;  Berger,  1970; 
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Montgomery  and  Jackson,  1983;  Erbcr,  1979). 

The  networks’  estimations  were  also  compared  with  estimates  obtained  from  a  template  match¬ 
ing  approach  used  extensively  for  pattern  matching.  Those  images  used  to  train  the  network  were  now 
used  as  the  stored  templates.  Given  a  new  image,  the  closest  match  or  matchas  were  found  among  the 
stored  images  and  the  corresponding  spectral  envelopes  were  averaged  together.  This  is  then  used  as 
the  estimate  of  the  spectnun  envelope  associated  with  that  image. 

The  quality  of  the  estimate  depended  heavily  on  how  many  spectral  envelopes  were  averaged.  It 
was  found  tl^t  finding  the  Hve  closest  images  provided  the  best  estimates.  Using  five  envelopes,  the 
template  matching  method  performed  on  par  with  the  neural  networks. 

This  alone  would  not  be  remarkable,  unless  one  considers  the  problems  involved  in  implemen- 
tating  the  template  method.  As  the  number  of  templates  increases,  the  computation  needed  to  do  the 
comparison  and  ranking  increases  as  0(^^),  where  N  is  the  number  of  pixels  in  the  images.  The  only 
solution  is  to  create  a  smaller  set  of  templates.  However,  Petajan’s  work  (1988)  showed  that  encoding 
the  images  resulted  in  poorer  performance  for  a  similar  Up  reading  tasks. 

WORK  IN  PROGRESS 

In  addition  to  the  template  matching  method,  the  networks  are  being  compared  to  other  estima¬ 
tion  techniques.  Most  estimation  procedures  are  designed  to  work  with  input  and  output  data  of  small 
dimensionaUty.  As  a  result  of  this,  it  is  often  ne<%ssary  to  encode  the  input  and  output  data  before 
some  type  of  regression  is  attempted.  The  problem  is  choosing  the  correct  encoding.  If  the  parame¬ 
ters  in  the  input  that  are  necessary  to  predict  the  output  are  not  clearly  identified,  then  one  has  two 
choices. 

First,  one  can  select  some  parameters  a  priori,  and  test  to  see  how  weU  the  parameters  correlate 
with  the  output,  and  account  for  the  variance.  Towards  this  end,  we  will  review  those  parametric  stu¬ 
dies  of  human  Up  reading  that  have  been  performed  (Montgomery  and  Jackson,  1983).  The  goal  of 
these  studies  was  to  determine  those  parameters  available  in  the  visual  signal  that  could  determine  the 
vowel  being  spoken.  The  comparison  will  aUow  us  to  evaluate  how  well  the  network  is  selectmg  its 
parameters  as  compared  to  experts. 

The  second  method  is  to  choose  some  encoding  based  upon  some  known  criteria,  such  as  linear 
Icast-square-error  (LLSE)  encoding  (Gonzales  and  Wintz,  1977).  The  optimal  LLSE  encoding  can  be 
obtained  using  principal  component  analysis.  Using  this  encoding  the  images  and  their  associated 
acoustic  spectra  wUl  be  represented  in  terms  of  their  principal  components.  Next,  an  attempt  wiU  be 
made  to  fit  a  linear  mapping  from  the  input  data  set  to  the  output  data  set  usmg  linear  regression. 
Once  a  fit  is  deHned,  a  new  image  could  Iw  encoded  and  used  to  construct  an  estimate  of  the  associ¬ 
ated  spectra.  This  estimate  will  be  in  terms  of  the  principal  components  used  to  describe  the  acoustic 
envelopes  in  the  training  set.  This  method  vrill  rev^  whether  or  not  an  optimal  LLSE  encoding  wiU 
collapse  the  data  along  dimensions  which  are  vital  to  the  problem  under  study.  There  is  no  reason  to 
beUeve  that  it  won’t. 

One  of  the  benefits  of  using  neural  networks  is  the  ease  with  which  additional  constraints  can  be 
mtroduced.  In  the  coming  year,  we  hope  to  improve  the  performance  of  the  networks  by  using  the 
dynamical  constraints  inherent  in  the  speech  production  process.  As  part  of  this  effort,  we  intend  to 
use  networks  with  recurrent  links  that  will  aUow  us  to  work  with  sequences  of  images. 

LONG  TERM  IMPLICATIONS 

The  approach  taken  by  this  research  may  provide  a  basis  for  a  new  generation  of  speech  recog¬ 
nition  systems  that  use  two  sensory  channels.  In  designing  such  a  system,  the  engineer  can  benefit  by 
looking  at  the  best  speech  recognition  system  around,  the  human  being.  The  parts  of  this  system  that 
are  the  most  fully  studied  and  best  understood  are  the  human  acoustic  and  visual  preprocessing  sys¬ 
tems.  Already,  acoustic  speech  recognition  systems  are  benefitting  from  what  is  known  about  the 
human  auditory  system  by  using  models  of  the  human  ear  as  a  front  end  (Jelinek,  F.,  1985). 

At  Caltech,  Carver  Mead  has  already  successfully  designed  and  fabricated  a  variety  of  synthetic 
retinas  and  cochleas  in  analog  VLSI.  These  peripheral  systems  process  massive  amounts  of  sensory 
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data  in  real  time,  and  output  a  distilled,  parallel  and  analog  representation.  Given  these  parallel  output 
from  two  channels,  the  question  is  how  to  combine  them. 

The  traditional  approach  would  be  to  encode  their  outputs  symbolically  and  to  try  and  define 
constraints  between  these  two  symbol  stringii.  One  of  the  prc^lems  with  encoding  these  signals  from 
these  two  channels  is  that  the  symbolic  encoding  can  obscure  constraints  that  might  otherwise  be  use¬ 
ful,  or  quite  simply  might  throw  the  information  away. 

The  alternative  approach  is  to  maintain  the  distributive  representation  that  comes  out  of  these 
channels  and  attempt  to  combine  them  at  a  sub-symbolic  level.  T^  research  looks  at  the  feasibility  of 
this  second  approach. 
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